Quasi Static model
Introduction
We'll see a model which may be used with arithmetic coding, a range coder, or even static huffman. The Quasi Static model tries to get the best of the two main models used, static (also known as semi static) and adaptative, working in one pass but faster than an adaptative one (but slower than a really static one). This article will deal only with how to use it with arithmetic coding or a range coder.
Quasi Static model
For achieving one pass while adapting to the data, qs-model updates the probability of every symbol when it's processed, and at some intervals, it builds again the cumulative probabilities. It has a fixed maximum cumulative probability, which cannot be exceeded. If that's 100, then all the probabilities sum up to 100. How to do this? Let's say we want to sum up to 12 and we have 3 probabilities (we rather have to do three updates). Then we compute 12/3 = 4 and see that every new probability should have a value of 4 instead of 1. In the same way if we only have two elements, 12/2 = 6. In this case we can define the "unit" as 6.
But we may get problems, let's say we want it to sum up to 13 (the maximum cumulative probability) and then we do 13/4 = 3. Ok, but the sum of 4 probabilities with a value of 3 is 12 and not 13! Of course we could use a higher precision: 13/4 = 3.25, and this will make a maximum cumulative probability of 13. But using floating point is slow, can't we do it in another way? Sure we can, we compute the remainder of the division, 13%4 = 1 (the % operator is also called mod), and the first x values (where x is the remainder). Instead of having a value of 13/4, they have a value of (13/4)+1.
Let's say we have a maximum cumulative probability of 14 and want to assign probabilities to 4 values (4 different values or 4 times, it doesn't matter). Then we compute the value 14/4 = 3 and the remainder 14%4 = 2, so we know that the values to add are 4, 4, 3, 3, which sums up to 14.
The other main concept of the qs-model is the interval, that is how many symbols we have to process before we update the cumulative probabilities. We could choose a fixed one and use always the same. However, we want to adapt fast. In this case we start with a given interval (let's say 2) and a target interval (as an example 32) and then every time the real interval is finished, we multiply it by 2, and thus we make it bigger, till it's as big as the target interval. I.e: 2, 4, 8, 16, 32, 32, 32,... We don't want to have an interval greater than our target interval, because it can hurt compression.
Data structures: The qs-model keeps the probabilities of the input symbols and updates it for every symbol. For use with a range coder (or an arithmetic coder) it keeps a cumulative probability table, which is updated every interval. Every probability (not cumulative probability) is incremented, however as we have seen, our maximum cumulative probability is restricted, so instead we have to compute how much we have to add to every probability to sum up to the maximum. In two variables we keep the number of symbols to be seen until the next rescale, to deal with the remainder as explained above. One of them keeps track of the updates which have a probability of increment+1. The other two variables are rescaled (the current interval) and target rescaled (the target interval).
Due to the fact that we can choose a maximum cumulative probability which will remain always the same, we can make it 2^x, so in the range coder, we can replace a division with a shift right.
If we put everything together we have our Quasi Static model, of which pseudo code now follows.
Qs-model pseudo code
The Qs-model, like any other model, includes the basic operations, getting the cumulative probability of a symbol and updating the probability of a symbol. To keep the model you have some other routines, like restarting, rescaling, initializing and deleting, which deal with memory management and the variables of the model itself, which the coder doesn't have to care about.
In your implementation you can use structures for dealing with the variables, I recommend to do so (to keep more than one qs-model at the same time easily) however in the pseudo code this will not be shown, though there's no really great difference. Also this is independent of the language, and this is not the goal of pseudo code. If you want to see an example of it, look at the implementation of Michael Schindler (also see references).
And here it is the pseudo code.
Initqsmodel:
- N = 257 (alphabetsize, 256 bytes + 1 EoF, which we could not use)
- Targetrescale = 2000 (This is the maximum rescale, tune it to suit your needs)
- Cf [] = assign memory (Cumulative probabilities, updated every rescale)
- Newf [] = assign memory (The probabilities (frequencies) of every symbol, updated for every symbol seen)
- Cf [n] = 1 << lg_totf (This is the maximum cumulative probability, never changed; for a range coder: tot_f)
- Cf [0] = 0 (Init first cumulative probability)
- Resetqsmodel( ) (call the reset routine)
Variables used:
- N, this is the alphabet size, it's stored along with other variables to choose the increment (in case of 256 bytes).
- Targetrescale, the maximum rescale.
- Cf [], array with the cumulative probabilities, it has N entries.
- Newf [], array with the probabilities of the symbols, it also has N entries.
- Lg_totf. The logarithm (base 2) of the maximum cumulative probability. (12 is ok.)
Deleteqsmodel:
- Free memory of Cf []
- Free memory of Newf []
Resetqsmodel:
- Rescale = N>>4 OR 2 (Initial rescale, based on the alphabet size, to quickly adapt to the incoming data)
- Nextleft = 0 (Bytes seen, at start 0, obviously)
- Initval = tot_f / N (Define the "unit" of our model, how much to add to every probability to sum up to tot_f)
- End = tot_f MOD N (Compute the remainder)
- for ( i=0 ; i < end ; ++i ) (Init symbol probabilities, first "remainder" symbols with increment+1)
- Newf [i] = initvalue+1 (Init to symbolic 1 (tot_f / N) plus 1)
- for ( ; i < n ; ++i ) (The rest of the symbols only with tot_f / N)
- Newf [i] = initvalue
- dorescale( ) (Make cumulative probabilities and update variables of the qs-model)
Variables used:
- Rescale, the current interval, this will be doubled till it's equal to targetrescale.
- Initval, the increment, based on the alphabet size and maximum cumulative probability.
- Tot_f, maximum cumulative probability.
- End, the remainder, the number of symbols whose increment is bigger. (+1)
Dorescale:
- if ( nextleft != 0) (Check the remainder)
- incr++ (Next "nextleft" symbol will get more increment, to sum up to tot_f)
- left = nextleft ("Remainder" symbols get +1)
- nextleft = 0 (Next time do real rescale)
- end
- if ( rescale < targetrescale ) (Double rescale till it's not equal to targetrescale)
- rescale << 1 (Double, rescale*2)
- if ( rescale > targetrescale ) (Don't let it be greater than targetrescale)
- rescale = targetrescale
- cf = missing = cf [n] (Get tot_f)
- for ( i = (n-1) ; i != 0 ; --i ) (Visit all the probabilities but the first, i.e. build cumulative probabilities)
- Tmp = newf [i] (Probability of a given symbol)
- cf = cf - tmp (Cumulative probability, also to see if there was any bug while rescaling)
- Cf [i] = cf (Update cumulative probability)
- Tmp >> 1 OR 1 (Divide by two and ensure that it never gets below 1)
- Missing = missing - tmp (What we need to sum up to tot_f)
- Newf [i] = tmp (Rescale and update probability of the symbol)
- if ( cf != new f[0]) (Check for bug - useful for developers)
- Let user know bug and end (Or whatever you want to do)
- Newf [0] >>1 OR 1 (Rescale the first symbol, which doesn't get updated in the loop)
- Missing -= newf [0]
- Incr = missing/rescale (Compute increment for probabilities)
- Nextleft = missing MOD rescale (And the remainder)
- Left = rescale-nextleft (Updates without increment - I mean without +1)
Variables used:
- Cf, this is not the array but rather the cumulative probability. It's used for making the cumulative probabilities.
- Missing, this is the value needed to sum up to tot_f, having in mind that we have rescaled the probabilities.
- Tmp, temporary variable used to update the cumulative probability and probability of the symbol currently being rescaled.
- Incr, this is the value to add when updating a probability.
Qsupdate:
- if ( left <= 0 ) (Check whether we have to rescale or not)
- dorescale( )
- left-- (Decrement the value which keeps track of the interval till the next rescale)
- newf [sym] += incr (Update the probability)
Variables used:
- Left, it keeps track of the symbols to be seen till the next rescaling.
- Sym, this is the symbol whose probability we want to update.
- Incr, this is the value to add, computed at dorescale.
Qsmgetprobability:
- lt_f = Cf [sym]
- sy_f = Cf [sym+1] - lt_f
Variables used:
- Sy_f, the frequency (probability) of the symbol.
- Lt_f, the cumulative probability of the symbol, sometimes called the low range of a symbol.
- Sym, the symbol whose probabilities we are reading.
Qsmgetsymbol:
- Tmp = N (Get alphabetsize)
- While ( Cf[Tmp] < cump ) (While this search worked for me, it may not work for you, in this case debug it)
- --Tmp
- Return (Tmp-1) (Tmp is the symbol)
Variables used:
- Cump, this is the cumulative probability which the decoder got from low and range (or whatever you use).
About parameters: If you want to encode bytes, the targetrescal may be 2000, and the maximum cumulative probability (tot_f) may be 4096. Of course I recommend you to tune these parameters till they suit your needs. (A higher targetinterval will lead to faster encoding but probably worse compression.)
Also optionally we can initialize the probabilities to a fixed set of values, based on the output which we expect to get. However, due to the zero frequency problem, in most of the cases this will have a negative impact, even if the initial probabilities are carefully chosen.
When you get the symbol with a given cumulative probability, the worst case is O(alphabetsize) (in the case of 256 bytes) and the average should be around O(alphabetsize/2). However (like in Michael Schindler's implementation), we can use a binary search, which will give a worst case of O(Log2(alphabetsize)).
Closing words
The qs-model used as in the C code from Michael Schindler is a good way to code the output of another algorithm, match lengths and literals, like any of the lz family, and gather probabilities in different tables. Unfortunately it's not suitable for going to orders larger than 1 or maybe 2. The Qs-model can be adapted to further suit your needs, for example: If you have a maximum cumulative probability of 4096 and an alphabet size of 256 you don't need to deal with the remainder, because 4096%256=0.
You may wonder why you should use the qs-model instead of a static model if the static model is faster than the qs-model. Simple: A static model needs two passes, one to gather probabilities and the second to output codes, so it is not too decent for on the fly compression, and when you have two passes you need temp space. For these reasons, and also for the fact that the qs-model is very easy to plug in any compression, I prefer the qs-model.
Thanks to Michael Schindler for his help with the Quasi Static model.
References
Range coder package. From Michael Schindler. Available at http://www.compressconsult.com/rangecoder/.
Michael Schindler, private communication.